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We present a computational model capable of predicting — above human accuracy — the 
degree of trust a person has toward their novel partner by observing the trust-related 
nonverbal cues expressed in their social interaction. We summarize our prior work, in 
which we identify nonverbal cues that signal untrustworthy behavior and also demonstrate 
the human mind's readiness to interpret those cues to assess the trustworthiness of a 
social robot. We demonstrate that domain knowledge gained from our prior work using 
human-subjects experiments, when incorporated into the feature engineering process, 
permits a computational model to outperform both human predictions and a baseline 
model built in naivete of this domain knowledge. We then present the construction 
of hidden Markov models to investigate temporal relationships among the trust-related 
nonverbal cues. By interpreting the resulting learned structure, we observe that models 
built to emulate different levels of trust exhibit different sequences of nonverbal cues. 
From this observation, we derived sequence-based temporal features that further improve 
the accuracy of our computational model. Our multi-step research process presented in 
this paper combines the strength of experimental manipulation and machine learning to 
not only design a computational trust model but also to further our understanding of the 
dynamics of interpersonal trust. 
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1. INTRODUCTION 

Robots have an immense potential to help people in domains such 
as education, healthcare, manufacturing, and disaster response. 
For instance, researchers have designed robots that take steps 
toward helping children learn a second language (Kanda et al, 
2004), assisting nurses with triage (Wilkes et al., 2010), and par- 
ticipating as part of a search and rescue team (Jung et al., 2013). 
As such robots begin to collaborate with us, we should con- 
sider mediating interpersonal or social factors that can affect the 
outcome of the human-robot team. Methods for incorporating 
pro-social interpersonal factors like trust, friendliness, engage- 
ment, rapport, and comfort, when designed in a way that is 
appropriate across different contexts, can enable socially assis- 
tive robots to develop cooperative relations with their human 
partners. Trust, in particular, has been shown to facilitate more 
open communication and information sharing between people 
(Maddux et al, 2007). Thus by establishing an appropriate sense 
of trust, robots may become more effective communicators and 
thereby increase their capacity to function as collaborative part- 
ners. When designing for such interactions, we need to answer 
how a robot can (1) behave such that humans develop trust 
toward the robot (i.e., the control signal). But to evaluate the 
effectiveness of such behavior, we first ask how a robot can 
(2) evaluate the degree to which an individual trusts the robot 
(i.e., feedback signal). Considering the development of trust from 
the perspective of control systems, a robot can continuously adapt 
its behavior to achieve a desired level of trust through a feedback 



signal that assesses a persons current level of trust toward the 
robot. 

This paper focuses on the development of a system that can 
infer the degree of trust a human has toward another social agent. 
For the purposes of the current investigation, we utilize a behav- 
ioral operationalization of trust as a willingness to cooperate for 
mutual gain even at a cost to individual asymmetric gain and even 
when such behavior leaves one vulnerable to asymmetric losses. 
In this paper, trusting behavior represents a persons willingness to 
cooperate with his partner and trustworthiness represents his part- 
ner's willingness to cooperate, which the person assesses before 
potentially engaging in trusting behavior. 

We present a computational model capable of predicting — 
above human accuracy — the subsequent trusting or distrusting 
behavior of an individual toward a novel partner, where these 
predictions are based on the nonverbal behaviors expressed dur- 
ing their social interaction. We predict trusting behaviors using 
a machine learning approach. Specifically, we employed super- 
vised learning methods, which present a broad toolset that focuses 
on accurate prediction of the values of one or more output 
variables given the values of an input vector. There is certainly 
much overlap in supervised learning and statistical methods in 
psychology, with many of the same techniques going by dif- 
ferent names, but supervised learning places more emphasis 
on the accuracy of the learned/fit model and permits a wider 
range of modeling techniques, both probabilistic/statistical and 
otherwise. 
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Our work consists of three major parts and is organized into 
the following phases: 

• Trust-Related Nonverbal Cues (Phase 1): We include a sum- 
mary of our prior work, in which we identify nonverbal cues 
that, when expressed by a social entity (i.e., humans or expres- 
sive robots), are perceived as signals of untrustworthy behavior. 

• Design of Prediction Model (Phase 2): We use the results from 
Phase 1 to inform the design of a computational model capable 
of predicting trust-related outcomes. We compare its predic- 
tion performance to a baseline model that is not informed by 
our Phase 1 results, a random model, an a priori model, and 
the predictions of human participants. 

• Temporal Dynamics of Trust (Phase 3): To improve the accu- 
racy of the prediction model developed in Phase 2, we derive 
additional features that capture the temporal relationships 
between the trust-related cues. With the addition of these fea- 
tures, our computational model achieves significantly better 
performance than all the baseline models as well as human 
judgement. 

2. BACKGROUND 

In situations where a persons past behaviors or reputation are 
unknown, we rely on other possible sources of information to 
infer a persons intentions and motivations. Nonverbal behaviors 
are a source of information about such underlying intentions, 
goals, and values and have often been explored as "honest" 
or "leaky" signals. These signals are primitive social signals 
that are thought to occur largely outside of people's conscious 
control (Pentland, 2008; Ambady and Weisbuch, 2010; Knapp 
and Hall, 2010). Nonverbal behaviors include body language, 
social touch, facial expressions, eye-gaze patterns, proxemics (i.e., 
interpersonal distancing), and vocal acoustics such as prosody 
and tone. Through these nonverbal expressions, we communi- 
cate mental states such as thoughts and feelings (Ambady and 
Weisbuch, 2010). 

Researchers working in domains such as human communica- 
tion modeling and social signal processing have worked toward 
modeling and interpreting the meaning behind nonverbal behav- 
ioral cues (Morency, 2010; Vinciarelli et al., 2012). By observing 
the nonverbal communication in social interactions, researchers 
have predicted outcomes of interviews (Pentland, 2008) and the 
success of negotiations (Maddux et al., 2007). In other work, 
by observing head, body, and hand gestures along with audi- 
tory cues in speech, a hidden conditional random field model 
could differentiate whether a person was agreeing in a debate 
with accuracy above chance (Bousmalis et al., 2011). In similar 
work, Kaliouby and Robinson (2004) used a dynamic Bayesian 
network model to infer a persons mental state of agreement, 
disagreement, concentration, interest, or confusion by observing 
only facial expressions and head movements. Other research has 
tried to model cognitive states, like frustration (Kapoor et al., 
2007), and social relations like influence and dominance among 
groups of people (Jayagopi et al, 2009; Pan et al, 201 1). However, 
this article describes the first work toward computationally 
predicting the trusting behavior of an individual toward a social 
partner. 



To the best of our knowledge, trust- recognition systems 
currently exist only in the context of assessing trust and repu- 
tation information among buyers and sellers in online commu- 
nities. By observing transaction histories, consumer ratings of 
sellers, and peer-to-peer recommendations, online services like 
Amazon and eBay utilize these computational models to decide 
whether an online service or product is trustworthy or rep- 
utable in the electronic marketplace (Pinyol and Sabater-Mir, 
2013). 

Research on detecting deception has also taken a behavioral 
approach to computational modeling, focusing on the atypical 
nonverbal behaviors produced by the cognitive effort of conceal- 
ing the truth (Meservy et al, 2005; Raiman et al, 201 1). Although 
related to the concept of trust, research on deception focuses nar- 
rowly on detecting purposeful deception and distinguishing lies 
from truths, whereas this article focuses more broadly on under- 
standing how much an individual trusts another person in more 
natural social encounters. 

We suspect that the absence of work on predicting trust in 
face-to-face social interaction comes in part from the uncer- 
tainty about which nonverbal behaviors contain predictive 
information for trust-related outcomes. Thus, we began our 
investigation with a search for social signals that help pre- 
dict the trustworthiness of an unfamiliar person in a social 
interaction. 

3. TRUST-RELATED SOCIAL SIGNALS (PHASE 1) 

In this section, we summarize the key findings of our prior work 
(for further details see DeSteno et al., 2012), in which we iden- 
tified a set of nonverbal cues that is indicative of untrustworthy 
behavior. We also demonstrated people's readiness to interpret 
those same cues to infer the trustworthiness of a social humanoid 
robot. 

After meeting someone for the time first, we often have a sense 
of how much we can trust this new person. In our prior work, 
we observed that when individuals have access to the nonverbal 
behaviors of their partner in a face-to-face conversation, they are 
more accurate in predicting their partner's trust-related behavior 
than when they only have access to verbal information in a web- 
based chat. 

Building from this result, we then investigated which specific 
nonverbal cues signal subsequent trusting or distrusting behavior. 
We hypothesized that the appearance of multiple nonverbal cues 
together, observed in the context of each other, would provide 
such predictive information as opposed to a single "golden cue" 
offering predictive ability. Rather than assuming a one-to-one 
correspondence between a specific behavior and its underlying 
meaning, we viewed the interpretation of nonverbal cues as highly 
context dependent. That is, by observing multiple cues in close 
temporal proximity, we gain more meaningful interpretations 
than by independently assessing the meaning behind a single 
cue. For example, an eye-roll in conjunction with a large grin 
can be more accurately interpreted as conveying humor, whereas 
observing an eye-roll in isolation could lead to the less accurate 
interpretation of contempt. The interpretation of the eye-roll is 
contingent upon the observation of the grin. We operationalized 
this contextual dependency through a mean value of occurrences 
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across a set of nonverbal cues seen within an interaction (i.e., the 
mean frequency of cues). 

We identified four nonverbal cues — face touching, arms 
crossed, leaning backward, and hand touching — that as a set 
are predictive of lower levels of trust. We found that increased 
frequency in the joint expression of these cues was directly 
associated with less trusting behavior, whereas none of these 
cues offered significant predictive ability when examined in iso- 
lation. In our experiments, trusting behavior was measured 
as participants' exchange action with their partner during an 
economic game (the game details are in section 4.1.3). Thus, 
the more frequently an individual expressed these cues, the 
less trusting was their behavior toward their partner in this 
game. 

To confirm these findings, we validated the nonverbal cue 
set through a human -subjects experiment in which we manip- 
ulated the occurrence of nonverbal cues exhibited by a humanoid 
robot. By utilizing a social robotic platform, we took advan- 
tage of its programmable behavior to control exactly which 
cues were emitted to each participant. Participants engaged in 
a social conversation with a robot that either (a) expressed 
neutral conversational gestures throughout the interaction, or 
(b) replaced some of those gestures with each of the four tar- 
get cues (shown in Figure 1). As predicted, the robot's expres- 
sion of the target cues resulted in participants perceiving the 
robot as a less trustworthy partner, both in terms of partici- 
pants' self reports as well as their trusting behavior (i.e., game 
exchange behavior) toward the robot. Thus, when individuals 
observed these cues from the robot, they trusted their robot 
partner less. 

From these human -subjects experiments conducted as part of 
our prior work (DeSteno et al, 2012), we extract three key find- 
ings that serve as guidelines to help inform the design of our 
computational trust model. 

1. There exists a set of four nonverbal cues associated with lower 
levels of trust: face touching, arms crossed, leaning backward, 
and hand touching. 

2. The joint appearance of cues, represented by their mean fre- 
quency, results in a stronger signal that is predictive of trusting 




FIGURE 1 | A participant engaging in a 10-min conversation with a 
teleoperated humanoid robot, Nexi, here expressing the low-trust cue 
of hand touching. 



outcomes, whereas the cues individually possess limited pre- 
dictive power. 

3. These nonverbal cues predict the trust- related behaviors of an 
individual that is either expressing or observing them. 

4. DESIGN OF PREDICTION MODEL (PHASE 2) 

In this section, we incorporate the guidelines gained from our 
human -subjects experiments in Phase 1 into the design of a 
computational model capable of predicting the degree of trust 
a person has toward a novel partner, which we will refer to as 
the "trust model" for brevity. By utilizing this domain knowl- 
edge in the feature engineering process, we are able to design 
a prediction model that outperforms not only a baseline model 
built in naivete of our prior findings but also outperforms human 
accuracy. 

4.1. MATERIALS (PHASE 2) 

First we describe the data collection material consisting of 
the human- subjects experiment, the operationalization of trust 
through an economic exchange game, and the video -coded anno- 
tations of the participants' nonverbal behavior. This data corpus 
is used to train and evaluate our trust model. We then describe 
our methods for model design, consisting of our strategies for fea- 
ture engineering, the nested cross-validation method to estimate 
the model's true error, and the learning algorithm and model 
representation selected to create our prediction model. 

4. 1. 1. Data collection material 

We leverage the pre-existing datasets from our human- subjects 
experiments in which the task scenario involved two participants 
interacting for 5-min and then asked to make trust judgements 
of one another. A total of 56 interaction pairs or 112 people par- 
ticipated in these studies. Some of this data (20 interaction pairs) 
originated from the human-human study described in Phase 1, 
and the remaining are from a separate study. The pool of par- 
ticipants was undergraduates attending Northeastern University 
in Boston, Massachusetts. 31% of the participants were male 
and 69% were female. The data collection materials included 
the raw videos of the human- subjects experiments, video -coded 
annotations of the participants' nonverbal behaviors, and trust 
measurements obtained through the Give-Some Game described 
in section 4.1.3. 

4.1.2. Human-subjects experiment 

The experiments consisted of two parts. Participants first engaged 
in a 5-min "get-to-know-you" interaction with another random 
participant (whom they did not know prior to the experiment). 
This part of the study was held in a quiet room, where partici- 
pants were seated at a table as shown in Figure 2. The participants 
were encouraged to discuss anything other than the experiment 
itself. To facilitate conversation, topic suggestions such as "Where 
are you from?" and "What do you like about the city?" were 
placed upon the table on slips of paper. Around the room, three 
time-synced cameras captured the frontal-view of each partici- 
pant along with a side-view of the participants (the perspective 
shown in Figure 2). For the second half of the experiment, 
the interaction partners played the Give-Some Game, explained 



www.frontiersin.org 



December 2013 | Volume 4 | Article 893 | 3 



Lee et al. 



Computational trust model 




FIGURE 2 | Lab room setup for human-subjects experiment. 

Participants engaged in a "get-to-know-you" interaction with another 
random participant. Slips of paper on the table listed some conversation 
topic suggestions. 



below. The participants were not told that they would play this 
cooperative game with their conversational partner until after the 
"get-to-know-you" period was over. 

4. 1.3. Operationalization of trust 

A participant's judgement of trust toward their novel partner was 
behaviorally measured through the Give-Some Game (Lange and 
Kuhlman, 1994). The Give-Some Game is similar to a traditional 
Prisoner's Dilemma game in that it represents a choice between 
self-interested behavior and cooperative behavior (DeSteno et al, 
2010). At the game's start, each player possesses four tokens. Each 
token is worth $1 to the player and $2 in the possession of their 
partner. Each player decides how many of the four tokens to 
give to their partner, keeping the remaining tokens for them- 
self. For maximum individual payoff, a player must keep all four 
tokens. This strategy ensures that the player receives at least $4 
(while giving nothing to their partner); anything they receive 
from their partner would further increase their payoff. For max- 
imum communal benefit, both of the players would need to 
give away all four tokens to the other, resulting in each player 
earning $8. 

The participants are separated into different rooms to prevent 
the communication of strategies. To limit apprehension about 
having to face a partner to whom they were selfish, participants 
are also told that they will not see each other again. Although 
the game is played individually, the outcome (the money a player 
wins) depends on the decisions made by both the players in the 
game. In the game, players are asked to: 

1. Decide how many tokens they want to give to their partner. 

2. Predict how many tokens they believe their partner will offer 
them. 

In this article, we consider the number of tokens a participant 
gives to represent how much they trust their partner to play 



cooperatively. In addition, we consider the discrepancy between 
the predicted and the actual number of tokens received to rep- 
resent how accurately people can judge the trustworthiness of a 
novel partner after a short interaction; this served as the human 
baseline which will be used in section 4.3.1. 

Rather than assessing purely economic decision making, inter- 
personal or social exchange games like the Give-Some Game 
have been shown by social psychologists to involve evaluations 
of a partner's personality, including his or her trustworthiness. 
For example, Fetchenhauer and Dunning (2012) had participants 
complete a monetary exchange game (where participants were 
given $5 and could gamble it for the chance to win $10). Half 
of the participants completed a social version of the game where 
they were told they were playing with another person, and their 
chances of interacting with a trustworthy person was 46%. The 
other half were told they were playing a lottery game (therefore, 
not involving any people) with the same probability of win- 
ning. Participants' behaviors were found to differ significantly 
in the social version of the game compared to the lottery ver- 
sion. That is, while participants made choices in the lottery that 
largely reflected the low probability of winning (46%), partici- 
pants in the social version of the game largely ignored the given 
base rates (i.e., the likelihood one would encounter a trustwor- 
thy person) when making the same economic exchange decision 
and instead made far riskier choices. Thus, participants believed 
that people would behave differently than a purely probabilis- 
tic lottery, and notably, that people would be more trustworthy 
than stated (Fetchenhauer and Dunning, 2012). These findings 
suggest that behavior in social exchange games does not reflect 
purely economic decision making, but includes assessments or 
judgements of the other person involved, including their trust- 
worthiness. 

Further supporting our operationalization of trust, data from 
the past validation experiment in Phase 1 demonstrated that a 
self-report measure of how much a participant trusted a robot 
partner was significantly positively correlated with how many 
tokens that participant decided to give the robot [r(75) = 0.26, 
p < 0.05]. 

4. 1.4. Video-coded annotations 

The nonverbal behaviors of the participants over the entire 5-min 
interaction were manually coded. The videos were coded inde- 
pendently by at least two coders who were blind to all hypotheses 
(average inter- rater reliability: p = 0.90). Given the high inter- 
rater agreement, data from only one coder is used for analysis. 
We coded for nonverbal behaviors that appeared frequently (by at 
least five participants) in the video-taped interactions. The start 
and stop times of the following behaviors were coded for each par- 
ticipant: smiling, laughing, not smiling, leaning forward, leaning 
backward, not leaning, eye contact, looking away, arms crossed, 
arms open, arms in lap, arms in conversational gesture, arms on 
table, hair touching, face touching, hand touching, body touch- 
ing, no touching, head shaking, head nodding, and head still (see 
visualization and behavior categories in Figure 3). Coders were 
instructed to code the video for only one participant and only 
one nonverbal category at a time. 
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Lean Sig nal 
lean forward 
no lean 

lean backward 

time 

Touch Sig nal Head Sig nal Smile Sig nal Arm Sig nal E ye Sig nal 

hair touching head shaking smiling arms crossed eye contact 

face touching head still not smiling arms open looking away 

hand touching head nodding laughing arms in lap 

body touching arms on table 

no touching arms in conversational gesture 

FIGURE 3 | Annotated nonverbal behaviors of participants. Gestures 
within a category are mutually exclusive. 



4.2. METHODS FOR MODEL DESIGN (PHASE 2) 

We employ two feature engineering strategies to find a subset 
of features that permits effective learning and prediction. After 
describing our feature engineering process, we detail the training 
and testing procedures used for model selection and model assess- 
ment. These procedures are chosen to assess the differences in the 
predictive power of a model that builds a subset of features using 
domain knowledge as opposed to another model that narrows its 
selection using a popular feature -selection algorithm. 

4.2.1. Feature engineering 

Feature engineering encompasses both extracting features that are 
believed to be informative and selecting an effective subset from 
amongst the extracted features. Feature selection is the choosing 
of useful features that are not redundant or irrelevant to create 
a predictive model with high accuracy and low risk for over- 
fitting (i.e., high generalizability). Domingos (2012) points out 
that "feature engineering is more difficult because it's domain- 
specific, while [machine] learners can be largely general-purpose 
. . . the most useful learners are those that facilitate incorporat- 
ing knowledge." We detail the initial full set of features that were 
extracted from our trust corpus, and we compare two strategies 
to narrow our selection of features to an effective subset. We 
first create a model that uses features chosen through a standard 
feature-selection technique called variable ranking. Leveraging 
our findings from Phase 1, we create another model that narrows 
and then extends the initial set of features by following the three 
guidelines listed in section 3. 

4.2.1.1. Feature extraction. From the video-coded nonverbal 
annotations, we determined how many times a participant 
emitted a particular cue during the interaction (e.g., 25 smile 
instances) and how long a participant held the cue through the 
duration of the interaction (e.g., for 5% of the interaction, the 
participant was smiling). The duration of a gesture provides 
additional information and has been used as a feature in other 
work (Bousmalis et al, 201 1). For instance, if a participant crosses 
their arms throughout the entire interaction, then the frequency 
of that gesture would register as just one arms crossed. The 
duration measure — but not the frequency measure — reflects the 



Table 1 | The 30 features for the domain-knowledge model which 
narrowed and extended the initial set of features by following the 
three guidelines from Phase 1. 
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prevalence of arms crossed in this scenario. The full set of features 
(42 in total) consists of the frequency and duration for each of the 
21 nonverbal cues. 

4.2.1.2. Feature selection. Variable ranking is a feature-selection 
method commonly used in practice; this method's popularity can 
be attributed in part to its simplicity and scalability. Variable 
ranking involves independent evaluation of each candidate fea- 
ture. As a consequence, it may choose redundant features and 
does not consider the predictive power of complementary fea- 
tures. Nonetheless, this method has had many reports of success 
in prior work (Guyon and Elisseeff, 2003). In this research, vari- 
able ranking scores each of the 42 features by the absolute value 
of its Pearsons correlation coefficient with respect to the number 
of tokens given; we then incorporate only the most highly ranked 
features in the trust model. 

In addition to the variable -ranking model, we build a second 
predictive model using the domain knowledge gained from our 
prior work. The three guidelines derived from Phase 1 inform this 
model's feature selection. First, we considered the frequency and 
the duration of only the four trust- related cues — face touching, 
arms crossed, leaning backward, and hand touching — expressed 
by the participant (feature x\ to x% in Table 1) and ignored other 
nonverbal behaviors. Secondly, since cues are more informative in 
their joint appearance, the model also uses as features the mean 
frequency and mean duration across the four trust- related cues 
(x9 and x\o). Finally, the model draws features from the nonverbal 
cues not only of the participant but of the participant's part- 
ner as well. The model therefore includes the partner's gestural 
frequencies, durations, and mean cues (features x\\ to X20); the 
model also includes the differences between the participant's and 
their partner's features to incorporate any interesting differences 
between their behaviors (features xn to X30). 

These 30 features (listed in Table 1) represent the incorpora- 
tion of domain-specific knowledge in the selection process. In 
absence of this knowledge, the alternative method of selection 
is to narrow the number of features using variable ranking. The 
model that uses this domain-knowledge selection method will 



www.frontiersin.org 



December 2013 | Volume 4 | Article 893 | 5 



Lee et al. 



Computational trust model 



be referred to as the domain-knowledge model, while its naive 
counterpart will be referred to as the standard- selection model. 

4.2.2. Prediction model 

We aim to demonstrate the effect of incorporating domain- 
knowledge in the feature-selection process on the performance 
of a prediction model. As such, rather than exploring and com- 
paring the predictive accuracies of various machine learning 
algorithms, we focus on support vector machines (SVMs) as the 
primary tool for our feature-focused investigation. SVMs were 
chosen for their wide use and prior success in modeling human 
behavior (Rienks et al., 2006; Kapoor et al., 2007; Jayagopi et al., 
2009). 

SVMs separate training examples into their classes using opti- 
mal hyperplanes with maximized margins; for a query, an SVM 
predicts based on which side of the class boundary the query 
example lies. To find effective separations, the training exam- 
ples are transformed from their original finite -dimensional space 
into a higher dimensional feature space by a non-linear mapping 
function called a kernel (Hastie et al., 2009). 

Each of the m = \\2 training examples in our dataset 
(one per participant) contains a vector of n = 30 features, 
~x = (x\, X2, • • • , x n ). Features were scaled to have values 
within [— 1, +1], preventing an over- reliance of the SVM on 
large-ranged features. The class label for an example is the 
number of tokens the participant gave their partner in the 
Give-Some Game, ye {0, 1, 2, 3, 4}. With our dataset {( x 
( x 2>yi)> • • • j Cx m>ym)}> we train and test our SVM model with 
a Gaussian kernel using the LIBSVM library (Chang and Lin, 
2011). 

4.2.3. Nested cross validation 

To estimate the true prediction error of a model selection pro- 
cess, we employ a nested cross-validation method. A challenge 
when modeling human behavior is collecting and annotating 
enough real-world data to partition into three substantial train- 
ing, validation, and testing sets. The training set is used to fit 
the models. For model selection, the validation set is used in tun- 
ing the model parameters to yield the lowest prediction error. 
For model assessment, the chosen model's prediction error (also 
called generalization error or true error) is estimated using the 
previously unseen testing set. In cases such as ours, when the 
sample size is small (m = 112), the method of cross validation 
(CV) is often used to estimate prediction error by partitioning 
the dataset into subsets, and in multiple rounds, each subset acts 
as the validation or testing set (depending on the analysis) while 
the remaining is used as the training set. One benefit of using 
cross validation is that the model can be trained from almost the 
whole dataset. 

When using CV for both model selection and model assess- 
ment, one has to be careful that data involved in the model 
selection process is not reused in the final assessment of the 
classifier, which would occur in the case of first cross validat- 
ing for model selection and then cross validating again with 
the same data for model assessment. When such reuse occurs, 
re -substitution error can falsely lower the estimate of true error 



(Hastie et al., 2009). We avoid such misleading results by conduct- 
ing nested CV to obtain an almost unbiased estimate of the true 
error expected on an independent dataset (Varma and Simon, 
2006). 

We evaluate the trust models through leave-one-out nested CV 
and follow the nested implementation as described by Varma 
and Simon (2006). The leave-one-out nested CV method includes 
an inner loop for model selection and an outer loop for model 
assessment. In each iteration of the outer loop, a training exam- 
ple is removed. With the remaining examples, the best choices 
for hyper-parameters and features (via model selection) are deter- 
mined and used to create a classifier. The resulting classifier then 
predicts the class of the "left out" example, resulting in some 
error. This process is repeated such that each training example 
is left out once. Prediction errors are then accumulated for a final 
mean prediction error (MPE) of the estimator. The MPE is cal- 
culated as the average absolute difference between the classifier's 
predictions and the true class labels. Of note, the nested CV pro- 
cess estimates the MPE of classifiers learned at every iteration of 
the outer loop. This provides a performance measure of an esti- 
mator (i.e., a learning algorithm) and not of a particular estimate 
(i.e., a single classifier). 

In general, the inner loop performs both feature selection and 
hyper-parameter tuning for model selection using the remaining 
data from the outer loop. But in our case, for the variable- 
ranking selection method (described in section 4.2.1.2), a subset 
of features is found before the inner loop, which then con- 
ducts CV for hyper-parameter tuning. More specifically, for our 
SVM models, the inner loop tunes the parameters by varying 
the values of the model's hyper-parameters C and y according 
to a grid search and chooses the values with the best predic- 
tion error, where the error is calculated by a leave-one-out CV. 
The cost parameter C balances the tradeoff between obtaining a 
low classification error on training examples and learning a large 
class-boundary margin, which reduces over- fitting to the training 
examples. In general, increasing the value of C reduces training 
error and risks over- fitting. The bandwidth parameter y controls 
the size of the Gaussian kernel's radius, which in effect deter- 
mines the smoothness of the boundary contours. Large values 
of y create a less smooth boundary (i.e., higher variance), which 
can lead to over- fitting, whereas small values create smoother 
boundaries, which can lead to under- fitting. A pseudo-algorithm 
detailing the exact steps of our entire procedure is available in the 
Appendix. 

Although cross-validation methods are a standard alternative 
for estimating prediction error when sample sizes are small, they 
have some limitations. In particular, leave-one-out CV can result 
in high variance estimates of the prediction error since the test- 
ing partition contains only one example. But in utilizing almost 
the whole dataset for training, the method is also regarded in 
achieving low biases (Japkowicz and Shah, 2011). In contrast to 
holdout methods (when a substantial independent test set is avail- 
able), nested cross-validation does not provide an estimate of the 
true prediction error of a particular classifier (i.e., a single trust 
model) but instead reports on the average performance of classi- 
fiers built from different partitions of the data (in our case, the 
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average performance of trust models are trained on data sets that 
each differ by one example). 

4.3. RESULTS AND DISCUSSION (PHASE 2) 

Here we discuss the prediction performance of the two compu- 
tational models: the domain-knowledge model (SVM-D) and the 
standard- selection model (SVM-S). We then compare these mod- 
els to a random model, an a priori model (i.e., a model that always 
predicts the most common class), and a human baseline. 

4.3.1. Results 

Through leave- one- out nested CV, the SVM-D model is estimated 
to have a MPE of 0.74. SVM-D's hyper-parameters have values 
of either [C: 8, y: 0.031] or [C: 2, y: 0.125] at different iterations 
of the outer loop (i.e., across the CV folds for model assess- 
ment). The SVM-S model is estimated to have a MPE of 1.00, 
and its hyper-parameters vary more than the SVM-D model (see 
Figure Al in the Appendix for hyper-parameter plots). 

We statistically assess whether the prediction errors between 
classifiers are different through the Wilcoxons Signed- Rank test 
(Japkowicz and Shah, 2011). Since we compare the performance 
of the SVM-D model to that of the SVM-S model, random model, 
a priori model, and a human baseline — resulting in four sta- 
tistical tests — we counteract the increased probability of Type I 
error (i.e., claiming a difference where there is none) by adjusting 
the significance level to a = ^ = 0.0125, as per the Bonferroni 
correction. 

According to this statistical test, our SVM-D model signifi- 
cantly outperforms the SVM-S model (see Table 2). In comparing 
to other baselines, we also found the SVM-D model to signif- 
icantly outperform a random model, which uniformly guesses 
either 0, 1,2, 3, or 4 tokens. 

In Table 2, the "human" category is not actually a model 
but rather the participants' predictions of how many tokens 
their partner will give them in the Give-Some Game (as men- 
tioned previously in section 4.1.3). These predictions served 
as a human baseline for how accurately people can perceive 
the trustworthiness of a stranger after a 5-min interac- 
tion. The SVM-D model significantly outperforms the human 
predictions. 

An a priori classifier ignores the nonverbal information from 
the features but knows the class distribution (i.e., distribution of 



Table 2 | The mean prediction error of the SVM-D 
(domain-knowledge) model, and its comparison to that of the SVM-S 
(standard-selection) model, a priori model, random model, and a 
human baseline. 



Model 


Mean prediction error 


7-test (a = 0.0125) 


SVM-D 


0.74 




A priori 


0.83 


p = 0.0173 


Human 


1.00 


p = 0.0011* 


SVM-S 


1.00 


p = 0.0004* 


Random 


1.46 


p< 0.0001* 



An asterisk symbol denotes statistical significance. 



tokens given, as shown in Figure 4). The a priori model always 
predicts the class with the lowest mean error: two tokens given. 
The SVM-D model outperforms the a priori model but not with 
statistical significance. 

4.3.2. Discussion 

By incorporating the guidelines derived from Phase 1 into the 
feature engineering process, we designed a prediction model 
(SVM-D) that outperformed not only the model built in naivete 
of those guidelines (SVM-S) but also outperformed human 
judgement. 

Of note, participants gave an average of 2.58 tokens yet pre- 
dicted that they would receive an average of 2.28 tokens. We 
believe this bias toward predicting a less generous return con- 
tributed to the error in the human predictions. Thus the SVM-D, 
SVM-S, and the a priori model all have an added advantage of 
being biased toward the majority class. 

Our SVM-D model outperformed the a priori model, which 
ignores nonverbal behavior data. However, the difference was 
not significant, and so we cannot yet say with confidence that 
the nonverbal data improved the prediction performance of our 
modeling algorithm. 

By identifying where our SVM-D model is making predic- 
tion errors, we can aim to find additional features that can help 
discriminate between the examples the model is most confused 
about. According to SVM-D's confusion matrix (Table 3), the 
model has difficulty distinguishing when an individual has a 
higher degree of trust toward their partner. For people who gave 
four tokens, the model generally predicts that two tokens will be 
given, contributing 55% of the total prediction error. We there- 
fore seek to improve upon the SVM-D model in Phase 3 by 
deriving new features that can help differentiate individuals with 
greater levels of trust toward their partners, ultimately result- 
ing in significantly more accurate predictions than the a priori 
model. 
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FIGURE 4 | The distribution of tokens given by participants. The 

majority (41 %) gave two tokens. An a priori model based on this 
distribution will always predict two tokens. 
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Table 3 | Confusion matrix for SVM-D revealing the model having 
difficulty distinguishing when an individual has a higher degree of 
trust toward their partner. 
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5. TEMPORAL DYNAMICS OF TRUST (PHASE 3) 

In this final phase, we improve upon the accuracy of the predic- 
tion model developed in Phase 2 by deriving additional sequence- 
based temporal features. 

In the last part of our investigation, the model built with 
features chosen through domain knowledge (SVM-D) out- 
performed a model unaware of this knowledge (SVM-S). 
Additionally, SVM-D was more accurate than the a priori model, 
but not significantly so. To improve the performance of our 
SVM-D model, we again turn to domain knowledge to guide our 
search for new features. 

As mentioned in Phase 1, we hypothesized that the appear- 
ance of multiple nonverbal cues together, observed in the context 
of each other, would provide reliable information about trust- 
ing outcomes. We operationalized this contextual joint appear- 
ance of the trust- related nonverbal cues through their mean 
frequency and mean duration as features for our trust model 
(described in section 4.2.1.2). These mean values attempt to 
roughly capture the temporal proximity of the trust- related cues 
occurring within a social interaction. We extend our hypothe- 
sis to another form of operationalizing "context" through the 
sequence of emitted cues. We anticipate that the contextual infor- 
mation given in the sequence of trust- related nonverbal cues 
contains predictive information not captured by the cues' mean 
frequencies or durations. Furthermore, rather than only observ- 
ing the sequence of the four nonverbal cues associated with lower 
levels of trust {low-trust cues), we aim to observe their inter- 
play with nonverbal cues associated with higher levels of trust 
{high-trust cues) , since we anticipate more discriminating patterns 
to emerge from the dynamic of both high-trust and low-trust 
cues. 

In Phase 3, we first describe the redesign of our experiment, 
from which we identify a set of high-trust nonverbal cues. We 
then present the construction of hidden Markov models (HMMs) 
to find temporal relationships among the new high-trust cues 
and the previously identified low-trust cues. From this tempo- 
ral model, we derive additional features that improve upon the 
performance of SVM-D from Phase 2. 

5.1. MATERIALS (PHASE 3) 

In this subsection, we describe the additional human- subjects 
experiment and analysis performed to identify a set of high-trust 
nonverbal cues. 



5. 1. 1. Human-subjects experiment redesign 

We redesigned our previous human -subjects experiment 
(detailed in section 4.1) with one key difference: after the social 
interaction and before the game, the participants were told that 
the average number of tokens given is one. However, participants 
were also told that the average may not be indicative of what their 
particular partner would do and were encouraged to let their 
social interaction guide their predictions. 

In the experiments that produced our original data (section 
4.1), participants decided how many tokens to give their partner 
without any prior knowledge of the expected or "normal" giv- 
ing behavior. These participants may have varied in their belief of 
how many tokens are thought to be unfairly low, fair, and over- 
generously high. By introducing this manipulated information 
about the average tokens given, we aim to shift the expecta- 
tion of token-giving behavior such that individuals expect the 
mean level of giving to be lower (i.e., closer to one token on 
average compared to 2.28 tokens on average from the previous 
experiment). 

This manipulation had two intended outcomes. First, we aim 
to lessen participants' variation in their beliefs of normal giving 
behaviors — a potential source of noise in our original data — 
and thereby increase the likelihood of a goodness-of-fit in how 
well certain nonverbal cues can predict token-giving outcomes. 
Second, by biasing participants toward a lower norm of one token, 
we aim to shift "norm-followers" away from higher levels of giv- 
ing. Norm- followers are those without preference in wanting to 
give more or less to their partner and therefore give according to 
the behavioral norm. We speculate that participants that then give 
more than one token are most likely those trusting their partners 
to play cooperatively and therefore willing to deviate from the 
established norm. Having shifted the norm- followers away from 
higher levels of giving, we then expect to have more homogeneity 
in the variance among the nonverbal behaviors that predict higher 
levels of trust. 

Through this manipulation, we anticipate to find a set of non- 
verbal cues that are significant predictors of trusting behavior in 
a positive direction, which we were unable to identify with the 
original dataset from Phase 2. 

A total of 16 interaction pairs (i.e., 32 participants), again 
undergraduate students at Northeastern University, participated 
in this redesigned study (41% male and 59% female). The two 
independent coders of the videos for this new study were also 
blind to all hypotheses, including any knowledge of our previ- 
ous findings. Each coder coded half of the videos from the study. 
To establish inter- rater reliability, each coder also coded a sub- 
set of the videos originally coded by the other independent coder 
(p = 0.93). 

5. 1.2. Identifying high-trust nonverbal cues 

To identify a set of nonverbal cues that are indicative of higher lev- 
els of trust, we employ the same procedure from our prior work. 
We briefly outline the procedure here (for more detail see DeSteno 
et al., 2012). We begin by examining the zero-order correlation 
between the frequency of a nonverbal cue emitted and the amount 
of tokens given. We identify which of the 22 nonverbal cues (i.e., 
those listed in section 4.1.4 with the addition of hand gesturing) 
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positively correlate with the number of tokens given. None of the 
correlations were both positive and significant individually, so we 
again considered sets of cues. We chose candidate sets through 
an ad hoc examination of correlation coefficients and p-values, 
and we tested their joint predictive ability through a multilevel 
regression analysis that controls for dyadic dependencies. As we 
hypothesized, we were able to find a set of cues — leaning forward, 
smiling, arms in lap, and arms open — that positively and signifi- 
cantly predicted token-giving outcomes in the Give-Some Game 
(O = 0.11, p < 0.04). 

Concluding that this cue set is indeed predictive of higher lev- 
els of trust would require experimental manipulation (as in the 
robotic experiment described in Phase 1 ) to confirm that this rela- 
tionship is not merely the result of spurious correlation. But our 
primary interest is deriving new features that capture temporal 
relationships between particular cues. We therefore continue to 
refer the set of cues — leaning forward, smiling, arms in lap, and 
arms open — as indicative of higher levels of trust. 

5.2. METHODS FOR MODEL DESIGN (PHASE 3) 

Suspecting that the sequence of nonverbal cues contains pre- 
dictive information not captured by the cues' mean frequencies 
or durations, we built a temporal model, capable of modeling 
processes over time. More specifically, we constructed HMMs to 
capture the temporal relationship between the newly identified 
high-trust cues and the previously identified low-trust cues using 
the original dataset from Phase 2 (section 4.1). Of note, the data 
from the redesigned experiment (section 5.1.1) is only used for 
the identification of the high-trust cues. 

HMMs are common representations for temporal pattern 
recognition. However, HMMs are commonly viewed as having 
low interpretability; the model's internal complexity hinders any 
qualitative understanding of the relationship between the input 
and predicted outcomes (Hastie et al., 2009). Although often 
treated as a black box technique, HMMs are capable of finding 
structures that reveal interesting patterns in the data. Our tech- 
nique described below demonstrates one method for leveraging 
an HMM's learned structure to derive new features. 

5.2.1. Temporal model 

In applications that have a temporal progression, HMMs repre- 
sent a sequence of observations in term of a network of hidden 
states, from which observations are probabilistically created. State 
St at time t is drawn probabilistically from a distribution condi- 
tioned on the previous state s t -i, and an observation o t prob- 
abilistically conditioned on the current state s t . Thus, to model 
a temporal process by an HMM, three probability distributions 
must be learned: over initial states, P(so); over state transi- 
tions, P(s t \s t -\); and over observations, P(o t \s t ). In this work, 
the parameters for these distributions are iteratively adjusted 
by expectation maximization (Rabiner, 1989) to maximize the 
likelihood of the observation sequences in the data given the 
parameterized model. Possible observations were the eight high- 
and low-trust cues: smiling, leaning forward, leaning backward, 
hand touching, face touching, arms open, arms crossed, and arms 
in lap. Based on cues' coded start times, we extracted the sequence 
of only these eight gestures for each participant during their 



5-min interaction. The sequence length per participant varied 
(min = 9, max = 87), since some individuals gesticulated these 
cues more often than others. 

Once the initial state, state transition, and state observation 
probabilities are selected, a trained HMM can generate a sample 
sequence of observations. The simulation first generates a sample 
state path (i.e., Markov chain) of a given length by selecting an 
initial state drawn from the initial state probability distribution 
and selecting subsequent states based on the transition probabili- 
ties. Given a sample state path, an observation is drawn based on 
each of the states' observation probabilities to then form a sample 
sequence of observations. 

When a model has many states, transition paths, and possible 
observations per state, deciphering the meaning of a state and 
its role in the state network is especially difficult. However, by 
simulating a trained HMM, we can qualitatively examine the 
generated observation sequence for informative patterns. To find 
discriminative patterns, we trained one HMMi ow from participants 
that gave two tokens away and another HMMhigh from participants 
that gave four tokens away and then searched for differences in their 
sample sequences of emitted nonverbal cues. Two tokens was chosen 
for HMMi ow because few participants gave 0 or 1 token (leading 
to insufficient data for training). By comparing a sample sequence 
of observations from HMMi ow and HMMhigh? we can discover 
any informative distinctions between their simulated outputs. For 
example, we could observe certain patterns of nonverbal cues that 
appear in succession in HMMi ow 's output that do not appear in 
HMMhigh 's. We can then use these unique and differentiating 
patterns to construct new features. 

5.2.2. Leave-one-out cross validation 

We determined the best model parameters for HMMi ow and 
HMMhigh via leave-one-out CV. A nested method was unneces- 
sary, since we aimed only to draw insight from the trained model 
and not to estimate the true prediction error. 

From our original dataset from section 4. 1, we have 26 training 
examples for HMMhigh and 46 training examples for HMMi ow . 
We ran 6000 simulations using the Bayes Net Toolbox (Murphy, 
2001). At every run we randomly initialized (drawn from a uni- 
form distribution) the number of states and the initial state, 
transition, and observation probabilities for both of the HMMs. 
To determine the prediction performance of the models' parame- 
ters, we use a leave-one-out CV method, where we leave the data 
from one participant out and train on the remaining 71 partic- 
ipants for the two HMMs. The omitted example is classified as 
either high or low trust, determined by which of HMMi ow or 
HMMhigh has a higher log-likelihood of creating the observations 
in the example. 

5.3. RESULTS AND DISCUSSION (PHASE 3) 

Below we interpret the resulting learned structure of our HMMs 
and discuss the ability of the newly derived temporal features to 
improve on the prediction accuracy of the SVM-D model from 
Phase 2. 

5.3.1. Results 

Through leave-one-out CV, our best model training result, with 
three states for HMMi ow and five states for HMMhigh, has a 
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recognition accuracy of 71% (51 hits and 21 misses) compared to 
64% at chance due to the uneven distributions. Our goal, how- 
ever, is not discriminative accuracy of these generative HMM 
models but rather to identify what interesting temporal patterns 
they capture. By simulating the trained HMMhigh > we get an 
output observation of: 

smiling -> smiling —> face touch —> smiling -> smiling -> 
hand touch -> arms in lap -> arms crossed -> arms in lap • • • 

And by simulating HMMi ow , we get an output observation of: 

smiling -> arms crossed —> face touch —> hand touch —> 
smiling -> /ace towc/z -> arms crossed -> smiling -> smiling • • • 

To make the pattern easier to decipher, we denote (+) as high- 
trust and (— ) as low- trust cues to form: 

HMM high = + + - + + - + - + ... 
HMMi ow = + + -- + + ••■ 

Both models alternate between high-trust and low-trust cues. 
HMMi ow 's sequence contains frequent consecutive low- trust cues, 
whereas HMMhigh 's sequence contains more consecutive high- 
trust cues. As posited previously, these observation sequences 
suggest that the order of nonverbal cues may provide further 
information for predicting trust- related outcomes. 

We use these findings to derive new features for our prediction 
model. Since SVMs do not naturally capture temporal dynamics, 
we represent the ordering of low- and high-trust gestures emit- 
ted using encoding templates. That is, by stepping through the 
sequence, we count the number of times (i.e., the frequency) in 
which we observe the following templates: 

. low-trust { qqQ , oo+ , +OD , H±H 

. high-trust { EBB , H+l+1 , fflFFB . FFH+I } 

With a sliding window of three cues, these templates in essence 
profile the neighboring cues (ones right before and right after a 
particular cue). 

We considered adding the following new feature-types for our 
trust model: 1) high-trust templates 2) high-trust cues 3) low- 
trust templates. When adding the frequencies of the high-trust 
templates as features to our model, the MPE increased to 0.80 
as compared to the previous SVM-D's MPE of 0.74 in Phase 2. 
When instead adding the features (that are analogous to the 
ones listed in Table 1) of the four high-trust nonverbal cues — 
leaning forward, smiling, arms in lap, and arms open — the model 
again increased in error with an MPE of 0.83. We therefore 
did not include the high-trust templates nor the four high-trust 
nonverbal cues into the final selection of features for our trust 
model. 

We created 12 new features, consisting of the frequencies in 
which the low- trust templates are emitted by the participant, their 
partner, and the difference in frequency between them (shown in 
Table 4). Through the inclusion of the low- trust template features 
toward the training of a final model, our new trust model achieves 



Table 4 | Twelve new features, consisting of the frequencies in which 
the low-trust templates are emitted by the participant, their partner, 
and the difference in frequency between them, used to train the final 
SVM model. 
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Model 


Mean prediction error 


7-test (a = 0.0125) 


SVM-D 


0.71 






A priori 


0.83 




p = 0.0049* 


SVM-S 


0.86 




p = 0.0018* 


Human 


1.00 




p = 0.0003* 


Random 


1.46 




p< 0.0001* 



Of note, to maintain fair comparisons to a model not using domain knowledge, 
the SVM-S model uses the full 42 features detailed in section 4.2. 1. 1 (thus, 
not needing the variable ranking to find a smaller subset). An asterisk symbol 
denotes statistical significance. 



an overall MPE of 0.71, which now significantly outperforms the 
a priori model (see Table 5). 

Our final trust model consists of 42 features listed in 
Tables 1, 4. To better understand the contribution of different 
components of trust signals, we performed an additional analysis 
to study the effects of removing particular categories of features 
on the trust model's performance. As shown in Figure 5, the MPE 
increases when excluding certain categories of features. When 
removing duration-type features (features X5 • • • x$, x\s • • -xi$, 
and X25 • • • X2s listed in Table 1) from the full set of 42 features, the 
trust model's performance is most heavily effected. This suggests 
that the duration, or prevalence, of a gesture provides impor- 
tant information for the trust model. Interestingly, removing 
information about the partner's nonverbal behaviors has greater 
effects than removing the behavioral information of the individ- 
ual whose trusting behavior we are trying to predict. This may 
suggest that when predicting the trusting behaviors of an indi- 
vidual, rather than directly observing their nonverbal behavior 
for "honest" or "leaky" signals, it is more informative to observe 
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Effects of excluding features on MPE 
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FIGURE 5 | The effects of excluding categories of features on the trust 
model's mean prediction error. The legend lists the exact features of a 
category that were excluded, and their descriptions can be found in 
Tables 1, 4. 



their partner whose behaviors greatly influence the individual's 
decision to trust. 

5.3.2. Discussion 

The inability of the high-trust cues and templates to enhance the 
predictive power of our model is not surprising in light of the dif- 
ferences between the study from which the high-trust cues were 
identified and the original studies whose data is used to train and 
test our model. The experiments that provided our original data 
collection material (in section 4.1) were conducted in a friendly 
and prosocial context; as shown in Figure 4, the number of tokens 
participants tended to give away fell on the high or trusting end 
of the distribution. When the default expectation is cooperation 
(i.e., most participants give away high numbers of tokens), then 
those that deviate from this expectation are most likely the par- 
ticipants that did not trust their partner to play cooperatively; 
this scenario is the direct opposite of the one described in sec- 
tion 5.1.1. Thus, it is not surprising that the nonverbal cues we 
identified as being most predictive in these experiments were neg- 
ative predictors related to lower levels of giving. In our context 
where the behavioral norm is for people to be more cooperative or 
trusting, the high-trust cues and templates lose predictive power. 
In line with what we observed, when adding the high-trust cues 
and templates as potential features, the trust model's predictive 
performance decreased. 

5.4. GENERAL DISCUSSION 

Our research sought to answer how a robot can (1) behave such 
that humans develop trust toward the robot (i.e., the control sig- 
nal) and (2) evaluate the degree to which an individual trusts the 
robot (i.e., feedback signal). Our prior work in Phase 1 demon- 
strated the capacity of a social humanoid robot to influence a 
person's trust toward the robot through either the presence or 
the absence of specific nonverbal gestures. Our current work in 



Phases 2 and 3 demonstrates the capacity of a computational trust 
model to evaluate the degree to which a person trusts another 
social agent. An important next step is to answer how a robot can 
dynamically adapt its behavior to achieve or maintain a desired 
level of trust based on its continual reading and interpretation of 
a person's current level of trust toward the robot. However, before 
the computational trust model will readily work for a social robot 
in a wide range of real-world interactions, the model's current 
limitations will need to be addressed. We discuss three limiting 
dependencies of the model below. 

In its current implementation, the model relies on hand- coded 
annotations of the nonverbal behaviors. For a robot to determine 
how much an individual trusts the robot, it will need to recognize 
these gestures autonomously. To model these gestures, 3D motion 
capture technology (like the Microsoft Kinect) can track the body 
movements of people, and gesture recognition algorithms can 
detect when particular nonverbal cues are being expressed. This 
low-level gesture recognition system can then feed into a high- 
level trust recognition system, which will be primarily driven by 
the trust model [see Lee (201 1) for an initial framework] . 

Secondly, the trust model relies on the behavioral operational - 
ization of trust through the Give-Some Game. A difficult but 
important question to consider is the game's ability to measure 
real-world interpersonal trust. In section 4.1.3, we provided sup- 
port that behavioral trust games like the Give-Some Game do 
not seem to purely assess economic decision making but instead 
involve social evaluations of other players. We also found that 
subjective measures of trust (via self report) were significantly 
positively correlated with participants' monetary decisions in the 
Give-Some Game. This suggests that the Give-Some Game is cap- 
turing behavior that is related to trust behaviors in the real world, 
and thus we expect that the current model can generalize to pre- 
dict other measures of trust or trusting behavior, particularly in 
situations similar to those in our experiments. However, future 
studies exploring how cue selection will be altered by changes 
in context (e.g., in situations where the default expectation is 
for others to be untrustworthy) will be necessary to expand the 
predictive ability of the current model to new contexts. 

Lastly, the model relies on the contextual constraints that are 
implicit to the laboratory setting. The data gathered in these 
experiments was based on undergraduate students around the 
age of 18-22 attending Northeastern University in Boston, MA. 
The participants met unfamiliar partners (a fellow student affil- 
iated with the same school), in a lab space (not a natural social 
setting), and for a short 5-min conversation. Given that the inter- 
pretation of nonverbal cues is highly context dependent, factors 
such as age, culture, group membership, and social environ- 
ment, which are largely specified in the lab setting used in our 
experiments, can influence how an individual interprets trust- 
related social signals. The trust model is context dependent in that 
it has no information about these factors in its representation. 
Therefore, the model performs accurately when making trust 
judgements in the setting in which its training data originated. If 
we were to use this model to determine how much an interviewer 
trusted an interviewee, we can anticipate a drop in performance. 
However a model that incorporates contextual knowledge as vari- 
ables can generalize to a greater variety of situations. Similarly, 
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by incorporating other communication modalities such as facial 
expression, prosody, and verbal semantics, the model's predictive 
accuracy will most likely improve. 

6. CONCLUSION 

We developed a computational model capable of predicting the 
degree of trust a person has toward a novel partner — as oper- 
ationalized by the number of tokens given in the Give-Some 
Game — by observing the trust-related nonverbal cues expressed 
in their social interaction. 

Our work began with a study to demonstrate that when peo- 
ple have access to the nonverbal cues of their partner, they are 
more accurate in their assessment of their partner's trustworthi- 
ness. Confident that there is trust-related information conveyed 
through nonverbal cues, we furthered our investigation to iden- 
tify and confirm a set of four nonverbal cues indicative of lower 
levels of trust and demonstrated people's readiness to interpret 
these same cues to infer the trustworthiness of a social humanoid 
robot. Through these studies, we drew three important guidelines 
that informed the design of our computational trust model. We 
demonstrated that by utilizing this domain knowledge in the fea- 
ture engineering process we could design a prediction model that 
outperforms a baseline model built in naivete of this knowledge. 

We highlighted the importance of representing an individual's 
nonverbal dynamics, i.e., the relationship and influence of behav- 
iors expressed sequentially by an individual. We represent the 
context within which an individual's nonverbal behaviors appear 
in two ways. The temporal dependency was first operational- 
ized through the mean value of occurrence of the trust- related 
nonverbal cues expressed within an interaction. Then in finer 
temporal granularity, we investigated the sequential interplay of 
low- trust and high-trust nonverbal cues through the construction 
and simulation of hidden Markov models. Through the inclu- 
sion of new sequence-based temporal features, our computational 
trust model achieves a prediction performance that is significantly 
better than our baseline models and more accurate than human 
judgement. 

Our multi-step research process combined the strength of 
experimental manipulation and machine learning to not only 
design a computational trust model but also to deepen our under- 
standing about the dynamics of interpersonal trust. Through 
experimental design and hypothesis testing, we were able to nar- 
row the wide field of variables to consider for our prediction 
model to the empirically found trust-related nonverbal cues. And 
by constructing a machine learning model capable of revealing 
temporal patterns, we discovered that the sequence of nonver- 
bal cues a person emits provides further indications of their trust 
orientation toward their partner. This intersection of method- 
ologies from social psychology and artificial intelligence research 
provides evidence of the usefulness of interdisciplinary tech- 
niques that push and pull each other to advance our scientific 
understanding of interpersonal trust. 
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APPENDIX 



Algorithm 1 Leave-one-out Nested Cross Validation 



Hyper-parameters values across CV folds 



..,2 i5 } 
,2 3 } 



Hyperparameters: C, 7 
Candidates for C: {2" 5 , 2" 3 , 
Candidates for 7: {2" 15 , 2" 13 
m = # of examples 
n = # of features 

x = { a?i, 1^2 • • • ~^m}, where x is an m x n matrix 

function modelAssessment 
for k = 1, 2, . . . , m do 

Set test = {xk} 

Set train = x \ {itk} 

model = modelSelectAndTrain(tram) 

err or k = SV M pre di C t(t^st } model) 
end for 

m P e = h Sfc error fc 

function modelSelectAndTrain(tram) 
train = featureSelection(£ rain) 
for c in range of C do 
for g in range of 7 do 

for j = 1,2, ... ,m — 1 do 
Set train eva i = {trainj} 
Set train fa = train \ {trainj} 
model j — SVM tra in('tf , 0'if^fit,c 1 g) 
error c , g j = SV 'M predict (train evatl model j) 
end for 

end for 
end for 

[C*,7*] = argmin Cj g(mpe Cj g) 
model* = SVMtrainftrain, C*, 7*) 
return model* 



CM 

o 



4 
2 
0 
-2 
-4 
-6 
-8 
-10 
-12 
-14 
-16 



H 




























+ SVM-S (Phase 2) 

❖ SVM-D (Phase 2) 

❖ SVM-D (Phase 3) 

❖ SVM-S (Phase 3) 


























< 








> 




















< 


> H 


h— 
































































—I 


■ — ■ 


h— 




















I— 
—I 


— 





0 2 4 6 8 10 12 14 16 18 

log 2 (C) 

FIGURE A1 | This plot shows the hyper-parameters values C and y 
selected (not showing repetitions) at different iterations of the outer 
cross-validation loop for the SVMs built in Phases 2 and 3. SVM-D (in 
Phases 2 and 3) and SVM-S (in Phase 3) have relatively stable 
hyper-parameter values across cross-validation folds, while SVM-S (in 
Phase 2) can vary between eight sets of values. 
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